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A new method of analysis was used in the IJichigan 
Educational Assessment Program to test minimum competencies in fourth 
gtade ^reading achievement. This technique permitted a substantial 
decrease in testing timfe and costs. a;he 'original test consislt^d of 95 
items measuring 19 objectives; mastery was indicated by correct 
responses to four out of the five items measuring each objective- 
Data from those 1^096 studefits whose raw scores were between 36 and 
83 were re- analyzed. Several tests were used to determiA.6 which items 
were acceptable for analysis using the Rascfi model. It was felt that 
the 95 items fit the Rasch model, ^and that the item palibrations 
would yield standard log achievement scores (SLAs) that would 
accurately summarize where students fell on the latent trait measured 
by the test. These .SLAs provided essentially the same information as 
the number of objectives mastered, with a shorter test. For most 
students, the items were relatively simple given student achievement 
levels; -therefore, the' amount , of information provided was , slight. A 
.short, ten-item test was developed; analysis indicated that it 
imposed a'more rigid criterion tha^n the longer test. Masterly 
decisions using the short test led' td 1U.9% fewer Type II errors 
(false negatives) and 1,63i more Type I errors (false "positives). 
(GDC) ' 
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Introduction 



For too long, many in the educational comorunity have ignored the i'^ifor- 

mation that has been available to help make informed educational decisions,'^ 

It was not uncommon for th^re suits of thje administration of batteries 

of test^ to gather dust in offices and never be^ consulted before educational 

plans were formulated. Recent pressures for improving the delivery of 

basic skills in public education, occurring as they do at a time when the 

amount of monies available f(A public education seems to be decreasing, 

/ 

have intensified the needs of educational decision-makers for more effec- 

\__, 

tive educational plans. Test results provide one important source /for . 
information' to make better educational plans. However, even thoiigh the * 
Use of such data is incre^a^ng, the administration of tests and the 
analysis of results is expensive in terms of instructional time and dollar 
cqsts» This paper will present an introduction to a new method of test 
analysis and its application in creat^ing an alternative to current testing 
practice in State Assessment. This alternative seen^s to permit a sut^stan- 
tial decrease in total testing time and /^osts^Htiiout substantial loss in 
thfe/ information provided. 
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The Mlcht^an Educational Assessment Program 

For the past nine years the Michigan Educational Assessment Program 
(MEAB) has endeavored "to provide information on the status and progress 
of Michigan basic skills education" to state and local educational 
decision-makers and their clients. The assessment Is carriedr out by 
administering objective-referenced tests in "an important, but limit 
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number of minimal skills in reading and mathematics" at grades four at 
seven throughout the state. The results of these tests ^'provide for stanrir 
ard measurement of all pupils and help control personal bias and arbitrary 
judgments by educational decision-makers," In additqf^n, test results pro- 
vide input "when curricula (sic) decisions are being* made by curriculum 
specialists, both at the state and local levels," Moreover, MEAP ,test 
results are used "to identify high needs schools" so that the state can 
"initiate contacts with local school districts and offer to help them in v 
addressing the achievement problems there," More generally, "it is considered 
appropriate for the state to use MEAP test results as part of the process for 
allocating, state funds." Since so many important edu^cational programs and 
individual student decisions are based upon the results of these tests, a 
full understanding of the composition and performance of these tests is 
crucial. 

In order to enable users to understand the characteristics of the MEAP 
tests, the Research, Evaluation and Assessment "Services of the Michigan 
State Department of Education (MDE) publishes a comprehensive Tehcnical 
Report (MEAP, 1976) which includes various item and objective statistics. 



''•The quotations that appear in this paragraph are taken from a 
pamphlet published by MEAP, entitled "Do YOU Use MEAP Tests Appropriately?" 
The pamphlet is distributed to local district users of the tests results 
to assist in the appropriate use and interpretation of MEAP test data. ^ 

^The State has b^en piloting experimental versions of first aad 
tent;h grade MEAP tests. 
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Validity, reliability and item discrimination measures are provided, , 
These data 'indicate that the tests perform acceptably as tests. ^ The 
purpose, of this paper will be to report an examination of one of the tests, 
Grade Four MEAP Reading, under the assumptions pf the Rasch model. \je report 
■these results not to be critical of the current procedures userd by the 
Michigan Educational Assessment Program Iwit to explore additional procedures 
and techniques of analyzing the tests and reporting the result of state- 
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wide assessment. • * ^ ' ' 

The Grade Four ^EAP Reading Test was. chosen for the present analysis. 

'The test consists of 95 items which measure 19 fourth grade reading 

objectives. _There are five items for each objective on the test: mastery 

is reacheS if the student correctly answers 4 of 5 items. According to 

I* 

the State Summary, 61 percent of the pupils mastered 75 percent^, of the 
objectives statewide.- No, item data were reported in the s&nnary. 

The MEAP state sample tape provided a random sample of the result^ of 
the Grade Four Reading Test for approximately 5000 fourth grade students 
in Michigan. A random half of these students were' selected for the present 
analysis, yielding a case base of 2568 subjects.^ For the students in this 
analysis, the m!ean number of objectives mastered was 13.6 and the median 
was 16.2 objectives attained. Th.e niean number of items answered correctly 
by this sample group was 74.8 -and the median was . 83.0 items. 
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In ^ series of four articles, Rudmari (1977a5b,c,d) has offered some 

criticism of 'the tests based upon his arfalysis of the traditional measured 

of tes^' statistics. Our purpose here prevents us from, exploring either 

his or other* criticism Of MEAP. tests. - ' 

We Wi^h to thank Research, Evaluation and Assessment Services of the 

.Michigan Department ot Education for making available the data which 

supported this analysers. - s 

^The latest available Technical Report (MEAP5l976:18-25) reports reli- 
ability and item discrimination measures for each objective considered as a 
five item test. These data indicate that eleven (11) of the objectives haver 
KR-20 Reliability Coefficients greater thari *'.70 and one is below .49. 
median phi coefficient for the association between objective and item 
attainment was .88. • ^ 

^The^ SAMPLE procedure providied by SPSS' was used to select a random 
half (Nie, and others, 1975: 127-8). 



Rasch^ Models 

Recent developments in latent trait theory have occasioned a renewed 
interest in *'true score theory" (Lord and Novick, 1968) • Under the leadcr- 
ship of Wright and hiS' atudents (Wright, 1968, 1977; Wright apd Panchapakesarf, 
1969; Wright and Mead, 19'77), a latent trait model originally . proposed by 
Georg Ra^ch (1960, 1966) has caught the attention of the ed.ucational measure- 
mentj coawtunlty (see, for exampte, Journal of ,Educ at ional Measurement s 14 
(Summer), 1977). Under the assumptions of the Rasch model, an' ^^ndividual' s 
score is ''governed by the product of the ability ^achievement level} ^ of the 
person and the easiness of the item*' (Wright,^ 1968 :4), Tha equation which 
specifies the relationship between the achievement level of the subject and 
the difficulty of the item can be written: 4 * 7 
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where P^^^ is the probability the person v correctly answers itfem i, AL^ 
is the achievement level of person v, ^nd is the estimate of the diffi- 
culty of item i* AL is the 'Rasch standard achievement score expressed In 
log achievement units and D represents the Rasch log item difficulty score • 

These parameters are estimated from the distribution of raw scores and the 

2 ' \ 
P values of the items comprising the t^st. 'The result of fitting a set \ 



^entz and Bashaw (1977:161) note that reference to . ''ability" sometimes 
causey con fus^LorT which is unnecessary "if one is aware that 'ability' as used 
here is a generic term that means the trait or characteristic of the examinee - 
being measured by the particular test under consideration^" 

■ Bimbaum (1968: 402) notes that the Rasch model "is a special case o^^^ 
logistic model in whi^:h all items have the same discriminating powers, andj;'afl:^ 
i items \can vary only in their difficulties/' Handblet^n and Traub (1970) de&on-, 
strated that some information is lost by not fitting additional parameters* 
However, they note that a considerable increase in cost and clarity is incurred 
by fitting additional parameters* 



of items or persons to this model Is an Intexn^al measure of achievement 

and Item jllfy.culty in terms of the same units. This facilitates examiuations 

of test items, student performance .on tests, and instructional content tha£ 

were impossible under traditional mek^reitient techniques,^ These features 

I \\ 2 ' 

result in "person-free'^ and " test- free(S measurement. 
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Tinsely and Dawis (1972) demonstrate that decisions about items do 
not differ markedly under Rasch techniques and traditional techniques for 
choosing test items. The case is that Kasch techniques, allow for the 
selection of items as .efficiently as traditional te'chniquee in addition to 
providing additional measurement power. 

There is some dispute with regard to the extent to which measurement 
is "person free" and/or "test free." However, the model has been fo^nd, 
to be relatively robust -under violation' of assumptions given large enough 
sample sizes and "fitting" items (Tinsely and Dawis, 1972). 



Teat Construction 

An initial calibration of the 95 items ^ which constitute the Grade 
*Pour MEAP Reading Test was performed including only those students whose 
raw score was between 36 and 83. These score boundaries were Chosen with 
reference t,o the chance of a student correctly answering an item by guessing 
if there are four choices for each item. The score interval represents one 
and a half the chance level for students at the low end of the distrubution 
(1.5 X 24) and one half the chance, level at the top end (95 - (.5 x 24)). 

In the initial calibration, 1438 subjects were excluded from the 
calibration because their scores were outside the specified range: 34 students 

/ ■ • 1 • ■ ■ 

were imnedia^^y excluded because they achieved perfect scores; 224 students 
received a raw score below 36; 1214 students scored above 83. The first 
calibration was performed on 1096 students. The mean number of items correctly 
asnwered by this group was 54.8 and the median was 59.2. The mean standard 
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one of the assinnptions of the Tlasch model is that the item's are draWT;i 
from a homogeneous domain of content. Although the model seems to be robust 
under violations of this assimiption, a factor analysis of the 95 items on 
the Grade Four MEAP Reading Test was performed to test the dimensionality 
of the item set. This analysis yielded only one factor with an eigenvalue 
greater than one and explained 61 percent of the variance among the 95 iteihs. 
This seems to be reason to believe that the 95 items lie along a single 
dimension. 

Cypress' (1973: 4) found that ''the^^ estimates derived from the Rasch * 
Measurement Model were not independent of the group used to produce them. 
Differences were minimal in the middle score range, but. large in high and low 
score ranges.'* More stable estimates would seem obtainable from subjects in 
the middle .^ange of scores, those within the boundaries which we established.. 

Justification for the choice of these boundaries rests in our intuitive 
unwillingness to ^lieve that low scoring students^ are '^informed guessers.** 
Moreover, th^ middle range of scores seems to provide more atable information 
about item difficulty estimates. Robert Rentz, in a personal communication 
with the authors, stated that our procedures are perhaps more rigorous than 
necessary and that we may be too willing to believe the tests of fit provided 
by* the model, Rentz prefers to calibrate on all persons.. One of the 
perplexing aspects of work with the Rasch model is the unavailability' of 
any gpod decision rules for procedural issues. 



achievemtnt estimate for the students in the calibration was 219.5 with a 
standard deviation of 14.8. The results of this calibration yere examined 
to detflfrmlne how well this set of items "fit*' the model. [ 

There is no single statistic which measures^^e, fit of a set, of items 
to the iRasch model. Ther^op6,.we used a series of ''tests** to deterjaine 



whether|^the 95 items for fourth 'grade trading performed acceptably. First, 
we examined the Total Fit Mean Square (FMS) which is computed from each o$ 
the 95 items. This statistic represents the mean squared standard' residua 1^ ' 
between how an individual person of a given achievement level performed on 
items and how he/she could be expected to perform given the difficulty^ of 
the item, avej:a^ed^>yv^r persons. Wright and Mead (1977: 50) suggest: that 
this statistic *'will be large for an item if there are too tnany high ability 
persons who failed on an item and/ or too many low ability persons who 
succeeded." These values averageji over items yield a summary *'fit 
statistic." The value for this statistic obtained from the initial calibra- 
, tion of 95 items was .97 with a, standard error of .166. We know that a ^ 
standard error as high as .20 has been obtained in simulated dat;a that fit 
the model and so a value of .166 does not seem **too large'.'* 

Another indicator of test fit is the ratio between the *dbserved standard 
error of Total FMS and the one expected given the assimptions of the model 
over the paisticul^r set of items on whfcV^he calibrations were done. The 
expected FMS in these data was .043. Our procedures ^include ^the computation 
of the rfetio between the observed standard error of FMS and the expected 
standard error - the ^lue of which in this case was 3.88. Again, although 
there are no **rules" for assessing the magnitude of this number, experience 
indicates that a value of 3.00 or less is desireable. Therefore, the ratio 



The original Rasch achievement scores are in log units with a mean of 0 
and a standard deviation of 1. We have followed standard practice and trans- 
formed tffese scores to a distribution with a mean of 200 and a standard 
deviation of 10* ' • , 
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of observed to Expected FMS* is more elevated than we would llker It to be 

* / ' * 

.before we are willing to believe that the itemg fit the model. 

, Flntlly, the BICAL program (Wright *nd Mead, 1977) routinely computes 
the item ^Sharacteristig. ctiirve for each of st;c different score groups, ranging 
front extremely low scorers to fextremely high scorers. How well the individual^ 
i]^ each of tKese d*i£feren|: score groups perform on the items is measured 
by^^a Group Meatti Square (GMS) and its' standard devJ.ation* The standard^ 
deviations may be tre^ated as aj^i^Xf with one degree of freedom (Wright and 
Mead,' 1977: 37-39). Table 1 displays the GMS afid standard -deviations for 
each St the separate score gybups. The critical value for-*X? with 1 df at - 

• 01 is 6.6. Therefore, from Table 1/ we see that the subjects^in the' lowest 
and the highest score* group differ significantly in their performance with ' 
respect to the model. ^ The distributions of the item statistics that were 
produced, by the initial calibration ^re displayed in Table 2« ' 

V 
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M^e' do not want to make too strong a claim about, the exact distribution ; 
of the3e numbers. However, the values of the standard deviations in the . 
extreme gi^oups look sufficiently different from the values in the igp^dle ^ 
four groups for us to wonder about how well the items fit. , * ^ \ 
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Mean Squares and Standard deviations for* Six 
S<Jore Groups an initial calibrations of 94 item teat 
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Ittm Fit Statistics of 95 Items of 
Grade Four MEAP Reading Test 
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Rapch Scores |inJ Mumber Of^Ob-jectlves Mastered . 

" * * " ' * 

In the preceedlrig section we explored the extent to which the 95 items 

that comprise the Grade Four MEAP Reading Test-**fit'' the Rasch model. 

Our conclusion was chat the itema^ fit reasonably well and that the calibratiotis 

I. f ^ 

^.of the Items would yield standard log achievement^ scores (SLAS) that would 

•accurately smnmaSrlze where students fall on the latent trait measured by the 

fourth grade reading achievement test. In. this section we will explore 

how 'these SLAS are related to other summary measures. of. student achievement 

-that are currently reported from tzhfe-results of MEAP testin^^ We will 

^ attempt ho,, show that SLAS provide* ess^^ially the same information as nxjmber. 

6i objectives mastered. In the following section, however, we will demon- 

strate 'that SLAS allow for the creation of insturments which can >proyide fqr 

a substantial saving in testing ^without a loss of information; 

One sunroary measure which enjoys~Wlde use (despite the disclaimers of 

educators fesppnslble for MEAPV is the proportion of -students who master 

* 75 percent *of the 19 reading objectives. Many see this statistic as an 

overall picture of the general level of reading. If a sufficient ntimbe'r 

» » . 

of students mastAir 75 percent of the ^objectives, a reading program is thought 

^ . - s . . . ' 

to be doing an cidequate job of delivering "minimal skills." If the proportion 

of students mastering 15 objectives falls belpw a certain level, the district 

may qualify -for additional funds to support improving the delivery of those 

"minimal skills/' ' In the face of opposition to the use of such measures 



• ^ ^We realize that there is impcTrtant information about the performance 

of students on discrete reading objectives which is not captured in any 
% summary statistic »and that this information is important in making instruct- 
ional decisions at the district, building and student level. We do not argue 
that summary measures can replace such data» However, another analysis by 
the authors (in preparation) will Examine Che utility of Rasch scores to 
Reproduce the information contained in the mastery of discrete objectives 
and indicate ways in which tests can be redesigned to improve ;the quality 
of information about students* achievement with reference *^to discrete ,i 
reading skills,* 

12 
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derived from objective-referenced tests, single number ^tnnmaries are pr;ovided 
and are used to support educational policy decisions. It seems reasonable, 
then, to compare the performance of SLAS to the number of objectives mastered- 
in order to determine if the Rasch-derived^ scored p'rovided at least as much 
information as number of objectives mastered. Any new summary measure ought 
to work at least as well as the one it replaces. - • , . 

There is a l\igh positive correlation l^etween SLAS and number of objectives 

mastered (r = .93). Decisions tend not to be based upon the entire range of 

I ^ ' ... 

the nimibers of ^objectives mastered but to Be concentrated at that point which 

seems intuitively to indicate \iastery in a more global s^se, that is, at 75 

percent. Therefore, one way to examine the relationship between SLAS and 

number of objectiWS^^stered is to establish a criterion level for SLAS 

which is comparable to mastering 15 reading objectives. Two considerations 

guided our^ selection of a SLAS* criterion score. First, we noted that MEAP 

de^fines mastery of each objective at fouf correct of the five items which 

comprise the objectives, that is, 80 percent. Second, in other applications 

of the, Rasch technique to criterion-referenced tests (Kifer and Bramble, 1974), 

the SLAS which corresponded to. correctly answering 80 percent of the items on 

the test was applied. Therefore, we chose to set the SLAS criterion score at 

216, the score which students who answered 76 items correctly received. The 

question we now examine is whether we would make the ; same mastery decisions , 

about students using a SLAS criterion score of 216 as we would using mastery 

of 75 percent of the 19 reading objective at grade four. 

Students in the sample were coded into two groups: those who mastered 

15 or more objectives and those who mastered 14 or fewer. A distribution 

of SLAS was prepared for each group. These distributions appear in Table 3. 

We see that the SLAS distributions are considerably different between 

mai^sters and non-masters. The median SLAS for those students who mastered 14 



jor £ever objectives is 206 as compared to a median of 229 for those who 
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mastered 15 or more reading objectives*. Clearly- the distributions are 
different and the SLAS criterion score seems to sort students into mastery 
groups that have a similar composition to groups selected on the basis of 
mastering 75 percent of the objectives. The ^ata siramarized in Table 4 
present the similarities more expll'City. The cross tabulation of the two 
criteria for mastery shows that in the overwhelming majority (94.9 percent) 
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of cases, each 'criterion yields the same decision about the Mastery level of 
the student. Over one third (34.1 pe?«^ent) of the sample fail to piaster 15 
Obiectives and score below 216; three-fifths (60.8 percent) master at least 
15 objectives and score 216 or higher. For about one s'Cttdent in twenty (4.8 
percent), however, a score of 216 or higher is obtained even though they do 
not master at least 15 objectives. We suspect that these students either 
consistently master three of the five items in the objectives or master five 
of five for a limited number of objectives. In ei^ther case, their SLAS will 
be higher because of the relationship between SLAS and raw score dictated by 
the Rasch model. Whether or not these students constitute Type II errors 
(false negatives) need not concern us here. We simply note that this^ type 
of error has been traditionally deemed acceptable because all the student 
risks is additional instruction. Whatever the reason for the difference in 
classification from the different criteria, these cases are relatively rare. 
Even rarer are those students who master 15 objectives but score below 216. 
These are probably the students who consistently master only four of the 
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Relative Frequency Distributions of Standardized Log Achievement 
Scores (SIAS) o£ , Students Who Met and Who- Did Not Meet- MDE Criterion 
of Jtastery o'£ Fifteen Objectives on Grade Four MEAP Reading Test 



SLAS 


' MASTERS 


231 thur 258 


. - 


230 




229, ' " 


- 


228 


." - ■ ' 


227- 




226' 


r 


J> 225 


- 


224 


- • 


223 


- 


222 


0.4 


221 


0.4 


220 


0.4 


il9 


1.1 


218 


2.4 


217 


5.3 


216 


2.3 


215 


3.0 






" 214 


5.4 


213 


2.8 


212 


4,9 • 


211 - 


5.1 


210 


5.0 


I 

162 thru 209 


61.4 



TOTAL PERCENT 

MEAN 

SD 

MEDIAN 
(N) 



99.9 
204.0 

10.2 
205.8 
(998) 



MASTERS 
40.7 
9.0 

7.8 
5.8 



5.2 
5.5 
5.5 
4.3 
4.3 
4.1 
2.9 
1.8 
2.2 
0.3 
0.3 
0.1 
0.1 
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99.9 
230.2 
9.0 
228.^ 

(1570) 



TABLE 4 



Relationship Between Mastery of 75 Percent of 
Grade Four HEAP Reading bbjectives^'and Standardized 
Log Achievement Criterion Score Levels 



(Percent of Total) 



standardized Log. 
Achievement Score 



Master 14 
or fewer 
objectives 

Master , 15 
or more 
objectives 



Total 



LE 215 



34.1 

(875). 



.4 

(9) 



GE 216 



34.1 

(884) 



4.8 

(123) 



60.8 

(1561) 



65.6 

(1684) 



Tot^I 

38.9 ■ 

(998) 



61;1 
(1570) 



100.0 - 
(2568) 
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five items for each objective and they constitute only about one half of 
oncL percent. The point bi-serial correlation among these two criterion 
variables is .S9. We feel safe in concluding that using SLAS criterion 
score oi 216 enables us to make essentially the same mastery decisions as 
a mastery decision using 75 percent of the objectives* 

What laay be more informative than this summary discussion is the behavior 
of the .SLAS distribution over the restricted range where' mastery decisions 
are most difficult* Table 3 indicated that mastery decision^ are essentially^ 
being made in the score range 213 to 222. No student who mastered 15 or more 
objectives scoi^^ed lower thati 213 and no student who mastered at most 14 
objectives scored higher than 222. It is in this region of "overlap^^ where 
precise measurement is most desirable. We note that the Rasch model is tn^st 
efficient when the achievement level of the subjects are matched to the / ^ 

difficulty level of the items measuring their achievement. Less stfian 10 percent 
of the items from this test calibrate at the difficulty level which is near the 
region of "overlap" of these distributions. There are only eight items on the 

entire test with log item difficulties greater than 212 and only three of 

s 3 
these items have difficulties greater than 216. 



lit is possible to master 15 objectives with a SLAS of 206, corresponding 
to a raw score, o^ 60, In t;hese data, the lowest SLAS achieved by .-students 

who mastered 15 objectives was 213. 

2 

We believe that^mastery decisions about students at the extremes of the 
distribution are relatively easier than thosq about students in the middle of 
the^distribution. Table 3 indicated the "lumping" that occurs at the extremes. 
Oveir three-fifths of the non-masters (61.4 percent) fall in the first quartile 
of the total quartile of the total distribution of SLAS; two-fifths (40.7 
percent) of the masters fall within tVie top quartile. Moreover, there are no 
masters in the lowest quartile of non-ma6tfers in the top quartile. 

^Seven of the ten most difficult items on the test appear after test 
question number 88, suggesting that test order may be contributing to their 
difficulty level. We have not checked the rates of noncompletion for these 
items at this writing. 

/ 

t 



It i^lmportant to remember that the MEAP tests are designed to 
measure, "minimal competencies J* The fact that the competencies covered, 
in the 'fourth grade reading te^t may ;be somewhat below what constitutes 
a typical four^ grader's battery of reading skills is indicated by the 
fact that the average achievement level of the students in our sample - 
was 220 and the median waaf 222, These scores are considerably above the 
200 average iinposed by the calibration technique. What is troubling 
is the fact that so many students (65, percent score above 216) must 
take so many/ items tfiat are so easy for them, resulting in scores that 
are of ^Practically no instructional value, regardless of hoV they are 
reported. 

In this section, we have demonstrated the essential similarity 
between the decisions about the mastery of students on the Grade Four 
MEAP Reading Test using the Rasch model - derived SLAS and 75 percent of 
the objectives masj^ered. For the majority of students, we found that 
the items were relatively easy given their achievement levels and that 
the .amount of information available for instructional purposes was slight 
In the following section, we explore an alternative to current testing 
pracjtice which promises a significant reduction in the amount of testing 
without a loss in the information provided by current summary statistics. 

f 

O 
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The Rasch Model and Short Tests *^ . - 

* . The Rasch model offers a unique solution to the problem of state- 
wlde-assessmenr of ''minimal competencies." Under the assumptions of the 
Rasch model, measurement can be "test free/' It is not necessary to 
administer all items to all students in order to make statements about 
whether^^^e s^tudents have mastered certain "minimal competencies"-- 
whether in terms ^f 19 (or lOQ) reading' obj ectives or in terms of 95 
(or 10,000) reading items. A student who receives a .SLAS of 216 has met 
the criterion in terms of the conjxfent measured by Grade Four MEAI\ Reading. 
The power of the Rasch model lifes in its ability to allow us to determihg 
a stufient's SLAS by administering considerably fewer than 95 items. Once 
the items (or objectives) have been calibrated assigned a known difficulty 
level in relation to all the other items in the test all the items need 
not be administered to determine how students will perform on the skills 
that they measure.^ The Rasch model allows the educator to measure skills 
without directly testing for them. / 

In order to determine empirically tHTe ability of a short test to 
provide the same mastery information .about* students as longer tests, we 
developed a ten item test of fourth grade reading. The items were 
selected on the basis of the calibrations of items for the 95 item test. 

4. 

The items and their difficulty estimates are listed in Table 5. 



Brink (1972) demonstrated that since "the Rasch model scales items 
on easiness and subjects on achievement level," while "the Guttraan model 
orders items' on difficulty and the subjects on total score, " the Guttman 
model "does not possess the precision that may be possessed by a Rasch 
scale." We will not examine the underlying scability of the 95 items on 
the Grade Four MEAP Reading Test in this paper but willfsimply alert the 
reader to this property of the model^ 
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procedure used to identify items for, the ten item test involved 
the identification of the 20 items with the highest difficulty estimates 
on the 95 item test^ This li§t was then examined for those items with ^ 
the best' fit statistics, those primarily with FMS close to 1.00, Although 



\ the objectives with which the items were associated were not considered in 
^ their selection, we noted that -seven objectives contributed items to the 
test, with one objective alone contributing three items: Objective No, 11 
(see Appendix A) . We not^d also thatyfive of these items are in the last 
15 items that were administered, but MDE assures its* user that the test Is 
not speeded. Our choice of the twenty most difficult as the basis for the 
test rests on the consideration of the general level of easiness of the 
items reflatlve to ^the subjects taking the test. 

Having selected. the items for the short test, we again set the 
criterion for mastery at 80 percent of the ten items and assigned a SLAS 
criterion scN^re of 226 to the mastery decision.*'' We then, arrayed the 
results of this sorting by percent mastery and SLAS on the 95 item test. 
We shall' fi^st consider the relationship between SLAS on the 95 item test 
and SLAS on the short test. Table 6 reports the results of the comparison 
of mastery according to SLAS Qt( lYd on the 95 item test and a SLAS of 226 
on the tail item test. We see that there is a high correlation between the 
two criteria (r= .67). The mastery decisions agree in four fifths (81.0 
percent) of the cases: about a third (33.8 percent) score below 216 on 
the 95 item test ^md below 226 on the 10 item test; almost one half (47.2 
pCrcent) 



:) score above the respective SLAS criterion scores on both tests. 



^The SLAS criterion score is substantially .higher for the short test 
since the average item difficulty is substantially higher. Techniques for 
equating the different length tests allow direct comparison of the per- 
formance of' students on eitKer form of the test (Rentz and Bashaw, 1975; 



gp^l^" Brigman and Bashaw, 1976). 



TABLE 5 



Item Statistics for. Items Included in 10 Item Reading Test 



Item 


Item 


DisC|3 




Name 


Diff^ 


Index 




177 '1 


.96 


^1.14 • 


.97 


189 ' 


1.22 


1.22 


.97 


182 


.79 


. 1.10 


.98 


1100 


■ , ' .81 


.95 


1.03 


199- 


1.44 


.90 


-1.06 


134 


1..14: 


.81 


1.07 


197 


1.36 


'.74 


^ 1.10 


194 


1.27 


.73 


, l-.ll 


162 


1.31 ' 


,46 


1.16 


190 


1.85 


.58 


1.18 


^Rasch Log 


Item Difficulty estimates 


from 95 item 




calibration. 



^Discrimination Index estimated from 95 it^"""2^ibration. 



Fit Mean Square estimated from 95 item calibratidn. 
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What itf ^particularly interesting is that although about one fifth\l8.4 
liercent^ of the students did not meet the crit^^rion on the short test but 

TABLE 6 ^ABOUT HERE 
\J 

did meet the crit;erion of the 95 item test, less than one percent (0.6 
percent) passed the short test^and failed the longer version. The shorts 
test seems to impose a more rigid criterion than, the ledger test. 

To make sense of the pattenx in the off-diagonal cells in Table 6 
we must consider the kinds of error that may be* involved in makiiTg mastery 
decisions about students. Typ*e I errors, faltfe positives, involve decid- 
ing that a student has^mastered the content tested when, in> fact, he/she 
has not. Type II errors, false negatives, involve deciding that the student 
has not mastered the content when, infact, he/she has. If we assume that 
the results of the 95 item test are more believeable and accept that 
distribution as our picture of what is the case, the short vtest has caused 16 
Type I errors and 473 Type II errors. If we, in addition, assume that 
Type I errors are more serious since the cost may include deciding not to 
provide additional instruction where it is needed, we find that the short 
test performed exceptionally well. Using one-tenth the amount of testing, 
there were almost no false positives. If the short: test were used as a 
screening device for more exhaustive testing, the 473 Type II errors 
would be identified and corrected. Further, if the the purpose of addit- 
ional testing was diagnostic, almost half the students could be exempted. 
The consequent reduction in interference with instruction and cost of 
administering tests would be considerable. At least in so far as the 95 
item test represents a student's ''true" level of reading skill, the shqrt 
test would seem to perform adequately for making student mastery 
decisions . 

22 



Relationship Between Standdfdized Log Achievemeut 
Criterion Scores on 95 Item and 10 Item Grade Four 
HEAP Reading Tests 

(Percent of Total) 



Standardized Log 
Achievement Score 
10 Item Test 



LE 215 



GE 216 



[otal 



LE 225 


GE 226 


Total 


33.8 


0.6 


34.4 


(868) 


(16) 


(884) 








18.4 


47.2 


65.6 


(473) 


(1211) 


(1684) 


52.2 


47.8 


100. 0 


(1341) 


(1227) 


(2568) 




A similar result emerges when mastery decisions baaed \»pon the SLAS 
score on the ten item test are compared to those ba^ed upon iliastery of 75 
percent of the reading objectives. Table 7 shows agreement in 83,5 percent * 
of th6 cases. Even fever Type IX ertars '(14.9 percent) appear and only 
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slightly more Type I errors (1,6 percent). Again, we find that the short 
test sorts students into mastery groups almost as efficiently as th-e longer 
versions of the test. 

Conclus^ions | 

'The present paper does not: attempt to include any evaluation of either 
the HEAP Grade Four reading itettiSj objectives, or reports. What we have 
attempted to show Is that a relationship exists between number of object- 
ives mastered, total test Standiardl^ed Log Achievement Score> and SLAS 
derived from a ten item subset of the 95 items. Onr motivation for 
examining these relationships «tems from three diverse ^reas of concern 
about the current praticies of MDE in the MEAP. , 

First, many districts find that there, is little instructional use 
for MEAP results since nearly all ot their students "master" nearly all 
of the objectives* These districts do ^ however, use MEAP results. They . 

' s, 

use them to show that their stu\ients are at least acquiring '^minimal 
ompetencies . " We are not in ix po^iti-nn to evaluate this kind of use for 



c 



the data. We simply believe that <^ssent ial,ly the same iafobtiatlon could 
'be obtained by administering as; few as firve or ten items to students. Our 
analysis lends a great deal of support trO this contention. 

Second, with more and mojre local, state and federal programs requiring 
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TABLE 7 



Relations^aip Between Standardized Log 
Achievement Criterion Score on 10 Item Test 
and Mastery of 75 Percent of Grade Four 



.(Percent of Total) 



sStandardized Log 
Achievement Score 
10 Item Test 





LE 225 


GE 226 


'-Total' 


Master 14 
or fewer 
objectives 


/ 

37.3. 
(958) 


1.6 

(40) 


38.9 
(998) 


Master 15 
or more 
objectives 


14.9 

(383) 


46.2 ~ 
(1187) 


61.1 

(1570) 


Total ^ 


52.2 

(1341) 


47.8 

(1227) 


' 100.0 

(2568) 
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more and more evaluation data, testing time has become a major issue for 
many educators. It is very impQrtant that time which is devoted' to 
testing be usefut^ both for program evaluation as well as for instructional 
puirposes. Under classical test theory, testing data for one purp<5i^e are, 
usually not appropriate for the other. The Rasch model is a vehicle that 
provides a theoretical framework within which students may be tested with 
instxruments appropriate for their achievement level, both in terms of 
content and difficulty, and yet which yields data for comparative analysis 
Many responsible educators^h^ve^ proposed that some way be developed to 
allow local educational agencies flexibility in terms of the content and 
difficulty of the tests administered to their students. The curi*ent 
investigation suggests that a c\)re of as few as ten test- items from the 
present test could provide the WE wit-h essentially t^ie same summary 4ata 
on the attainment of minimal competencies as is currently available. 

Third, if a statewide item bank (such as is being developed in 
MTSS) could be created following the^ Oregon model (which includes the 
Rasch item difficulty estimate for every item that is placed in the bank) 
the MDE could reduce the extent to which they mighf: "dictate curriculum," 
Even within the context of testing for "minimal competencies," LEA's 
should be allowed to use achievement tests which reflect the content of 
their curriculum. When Items of known difficulty which cover a broad 
range of content are made available to the educational community, LEA's , 
will be able to tev^t for what' they teach and the MDE will be able to 
meaningfully summarize their data. 

In summary, we believe that the approach outlined in this paper 
provides a way to enhance the utility of HEAP. If the implications of 
this investigation are acted upon, testing time could be drastically 
reduced^while allowing for the testing of more diverse instructional 



content • Further exploration of thes6^ techniques for application in 
'miki development and the establishment of criterion levels is needed. 
However, our findings here, and in other investigations in progress, 
sugf^st that the Rasch model is a very promising tool for understanding 
the results of criterion-referenced tests. 
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APPENDIX A 



MEASURED IN THE 1977-78 
MICHIGAN EDUCATIONAL ASSESSMIi;NT PROGRAM* ' 

Graae4 



Objective 
Number - y 

1. 2.1 Given a reading selection at tiie third grade level, the learner will 

match a series of words in the selection with appropriate defini- 
tions. 

2. 2.2 Given a set of phrases, the student will indicate those phrases 

which have the same meaning. 

3. 3.2 Given a reading selection at the third grade level in which every 

fifth word has been replaced with a blank, the learner will choose 
the exact word appropriate to the blank space at 50% acciiracy. 

-4. 4.1 Given a method of arranging data, the learner will identify the 

method (e.g., color, size, importance, time, etc.) 

5, 4.4 Given a series of randomly placed words, the learner will b^ able to 

alphabetize the words through the first three letters. 

6, 5, 1 Given a series of reading selections, the learner will indicate those 

which are factual. 

7, 5.2 Given a series of reading selections, the learner will indicate those 

which are fictional. ' 

8, 6rl- Given a reading selection, the learner will be able to identify the 
6,3 author's purpose (e.g., persuasion, entertainment, propaganda, 

etc.) 

9, 7.1 Given a reading selection at the third grade level, the learner will 

select from a list of possible titles the one most appropriate as the 
title for that selection. 

10. ^ 7.2 Given a reading selection at the third grade level, the learner will 

select from a series of still pictures the one picture most appro- 
priate in depicting the main idea of the selection. 

11. 7.3 Given a reading selection at the third grade level, the learner will 

select from a number of short summaries the one which best 
summarizes the selection. , 

*This list contains only the objectives which are included in t^ie every-pupil portion of 
the 1977-78 MEAP tests. A complete set of the objectives is available in Minimal 
Performance Objectives for Communication Skills Education in Michigan, 
Michigan Department of Education. 
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12. . 8:4 'Given a Trading selection at the third grade level, the learner will 

\ match a series of direct quotations from the story with the char^ 
acter who'^s speaking. 

13. 10.3 Given a reading selection at the third grade level, the learner will 

chobee from a series of sentences that sentence which best de- 
scribes how a given character feels in a story. 

14. ^ 10.6 ■ ^ Given a selection containing figurative language, the learner will 

identify from a series of descriptive phrases the phrase that most 
accurately describes the mood expressed in the selection. 

15. 11.1 Given a reading selection at the third grade level, the learner will 

correctly match a series 5f causes with, a corresponding ."teries of 
/ effects. ^ 

16. ^ 11.2 Given a reading selection at the thifd graiie- iWel with the conclJ^ 

sion^of the story deleted, the learner will^select from a seri^ pf 
possible conclusions the one most appropriate to the selection. 

17. ,i3.1 Given a locational question, the learner will chj^se from a series of 

reference sources where that item will be found. 

19. 13.2 . Given a locational question about newspapers, theMeamer will 
select the section where ^he answer w^d be found. 

19. 14,1- Given a reading selection at the third grade level, the learner will 
14.3 \^ answer correctly a series of multiple choice questions relating to 
^ meanings, generalizations, or conclusions not expressed in the 
selection itself. 
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LIST OF it|:ms measuring each fourth grade objective 



■/■ ■ 

.Reading . Mathematics 

Objective ' Objective 

Number Item Numbe-r , 

1 ' 45,52,78,81,92/- 

2 ,. • 83-87^ ^ 

3 65-69 

4 16-20 

5 6-10 ■ 

6 *f 27-31 

7 35-39 

8 24,32,33,76,98 

9 . 41,53,74,89,97 
10 21,40,51,70.96 
11', , 34,43,80,90,99 
12-^ Y ^2,48,72,77,88 

13 ^ 47,49,75,79,93 

14 11-15 

15 23,44,50,91,100 
NJ 16 22,46,71,82,95 

• 17 V 55-59 

\ 
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Number 


' Item Number 


1 


106*-200 


2 


101-105 


3 


241-245 


' 4 


231-235 


5 


226-'230 


6 


136-140 


7 « 


176-18P 


8 


246/250 


9 


IllV-115 


10 


166-170 


11 


116-120 


12 


156-16'0 


13 


151-rl55 


14 


146-150 


15 


236-240 


16 ' 


191-195 


17 


111 IOC 

121-125 


18 


1 / 1-1 ID 


19 


oil C 

211-;215 


20 


251-255 


21 


106-110 


22 


1d1-1d5 


23 


1-5 


24 . , 


206t210 


25 


126-P13Q 


26 


201-205 


27 


141-145' 


28 


186-190 


29 


- 216-220 


/30 - 


221-225 


^\ 


2-56-260 


32 


181-185 


33 


' ■ 131-135 



32 



